Slow Programming & Close Reading

Introduction

The rationale of this project is found in what feels to me as a still uncomfortable clash between hermeneutics and distant reading methods. We understand and accept that quantitative approaches can tell us a lot about texts (cf. e.g. Van Dalen & Van Zundert 2007; Kestemont 2011; Ramsay 2012). At the same time well known practitioners of such methods tell us that in the end the patterns that emerge from number crunching and pattern recognition require hermeneutic interpretation to be given meaning (Hayles 2012:38; Underwood & Sellers 2015; Meister 1995). I assert a strong dichotomic predilection in DH research to this matter. It seems necessarily to be one of two. Patterns can be modeled and quantified, but this necessarily results in reductive measures that are lossy of the subtle distinctions that drive hermeneutic interpretation. The gain of this coarse reductiveness is stringent formalization, which ensures computational tractability and therefor the scale and power of the analysis of large numbers: we can measure into corpora without even looking at them with human eyes and intellect. The other option is to apply subtle hermeneutics through close reading a text. This gives the research the power of meticulous interpretation, of precise contextualizing of meaning, of capturing, representing, and interpreting complex heterogeneous knowledge. The loss here however is the power of scale: such hermeneutic precision can not be expressed by simple numbers, such heterogeneity cannot be modeled at scale. Therefor the hermeneutic model is a model of a single or a few texts and a model without computation, it is only interpretation. The quantitative model stands in opposition: this counts, and by using the computer it counts eerily fast and into huge corpora—but its understanding of the individual texts in corpora is poor.

My conjecture is that the root cause for this perceived dichotomy is the preconception that software must scale—that the usefulness of writing and reading with software, thus the usefulness of code literacy is limited to tasks that are repetitive and thus subject to automation. However, what if we would not focus on scale for a change. What if we apply the values of close reading (attention, detailism .. .. ) to programming? What would formalizing thus—as Slow Programming—in code the process of close reading tells us about a text? This Notebook is a experimental quest towards an answer to that question.

Reproducible acts of scholarly hermeneutics

I contend that hermeneutics and interpretation are not mutual exclusive with code. Software and automation can be used as reductive methods that limit interpretation or are only crude hermeneutic means on the level of code, but I assert that they need not be. In fact, if anything, code in its form of literate programming (Knuth 1984) is a meticulous precise description of process. Next to that code is also 'just text', just another semiotic system (Marino 2006, Van Zundert 2016). Therefore, if we agree that text is an excellent means for reporting hermeneutic process, code should even be better—because it is 'just' text, but with an edge: it will reproduce process meticulously, as long as we capture the hermeneutic process precisely enough. This is what I want to try to do in this experiment: model each and all hermeneutic choices when reading/editing a text meticulously into code. I will use an Object Oriented approach (Bogost 2009; Object Oriented Programming) to conceptualizing the model and creating the code. I will furthermore hold to these rules when 'Close Reading' a text using code:

  1. Only direct and indirect speech may be represented as string instances.
  2. There will be no ‘ghost’ objects or methods (unused program code) during the execution of the program.
  3. The resulting program should execute without producing Ruby exceptions or runtime errors.

Chapter 1 — Coding the Transcription

Right after laying down those rules I realized that I missed something. If I want to hold myself to the meticulous precise registration of a reproducible process, I should not for instance transform the 'raw' text by using a text processor (that is: changing the representation by typing and editing the text file), because such requires interpretation action, and all interpretative acts should be modeled or captured in code. It is important in my view that this code is executable computer code. It should express if not possible all, at least as many as possible of my interpretative and transformative acts. With XML a lot if not most of my scholarly effort, actions, and performance goes unregistered. Between the manuscript and the TEI tag <p> there is a considerable series of scholarly actions that go unregistered and are lost. The aim here is to try to see how much of that scholarly performance can actually be captured in code. Every effort I make should be computationally reproducible. Thus I needed a fourth rule for the project:

  1. Only direct and indirect speech may be represented as string instances.
  2. There will be no ‘ghost’ objects or methods (unused program code) during the execution of the program.
  3. The resulting program should execute without producing Ruby exceptions or runtime errors.
  4. All scholarly actions should be computationally reproducible

So, I can't manually alter or transform the source files I will be using. All scholarly effort shall be expressed in code that guarantees reproducible scholarly actions.

Setting up for the simulation

I will be meticulously programming the editorial workflow as a means of code literacy and code scholarship. I understand scholarly code as code that performs scholarly actions. The digital technology practically available in general for scholars is not as advanced that it may perform physical actions (apart from printing probably). However all other scholarly actions should be mimicked or simulated as closely as possible, and paramount: should be reproducible. Before we can do so some setting up is required. This notebook assumes that you have (access to) a computer with this Jupyter Notebook installed. If not you wouldn't be reading this. The other technical requirements are described as code comments in the next 'cell'. Please follow these requirements guidelines narrowly. This notebook will not work otherwise.


In [1]:
## Requirements

# This notebook requires ImageMagick (an image handling and transformation tool) to be installed.
# If you say this at the command line ($>): $> conversion --version
# And it gets you an answer similar to this: 
#    Version: ImageMagick 6.9.3-7 Q16 x86_64 2016-03-27 http://www.imagemagick.org
#    Copyright: Copyright (C) 1999-2016 ImageMagick Studio LLC
#    License: http://www.imagemagick.org/script/license.php
#    Features: Cipher DPC Modules 
#    Delegates (built-in): bzlib freetype jng jpeg ltdl lzma png tiff xml zlib 
# You are good to go
# If not see http://www.imagemagick.org/script/binary-releases.php for installation instructions.

# This notebook requires Tesseract (an OCR engine) to be installed.
# If you say this at the command line ($>): $> tesseract -v
# And it gets you an answer similar to this: 
#    tesseract 3.04.01
#    leptonica-1.73
#    libgif 4.2.3 : libjpeg 9a : libpng 1.6.21 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.5.0 : libopenjp2 2.1.0
# you are good to go. 
# If not see https://github.com/tesseract-ocr/tesseract/wiki for installation instructions.

Getting the manuscript

Usually the first task when editing is finding, selecting, and perusing one's sources. Obviously the tasks and actions related to getting to the manuscript and digitally photographing it can not be captured in this notebook. Both because computers are severely handicapted for performing such actions—there is a long way to go before computers are actually that actionable—and because of time and economic constraints. The results of such scholarly actions are luckily however in the invaluable care of the Württembergische Landesbibliothek Stuttgart (http://www.wlb-stuttgart.de/), and we can emulate 'going to the library and getting the source' by using the online facsimile bank of the Württembergische Landesbibliothek.

For this project I will be using the text of the Middle Dutch fable Of Reynaert the Fox. An extend manuscript of this text is found in the so called 'Comburg manuscript' which is in the care of the Würtembergische Landes Bibliothek under the description "Comburger Handschrift - mittelniederländische Sammelhandschrift - Cod.poet.et phil.fol.22". Putting 'Comburger' into the general search will get you to the manuscript. We can now emulate going to the library and getting the manuscript with the little script in the second cell followin this paragraph.

The first cell imports into the computing environment the libraries—that is: pieces of code not written by me but that we need nevertheless—that are needed for the project. Run this cell like any other to makes sure the libraries are loaded.


In [1]:
require 'open-uri'


Out[1]:
true

In [2]:
# The maximum zoom level images are statically available from this URL:
# http://digital.wlb-stuttgart.de/filegroups/combha-m_323970265/max/
# The folios 192v-213r coincide with the JPGs 00000388.jpg-00000429.jpg
base_url = "http://digital.wlb-stuttgart.de/filegroups/combha-m_323970265/max/"

side = "r"
folio_number = 192
(388..429).each do |jpgn|

  # Instead of jpg order numbers let's add folio numbering to add a bit of
  # scholarly atmosphere.
  if side.eql?("v")
    side="r"
    folio_number += 1
  else
    side="v"
  end

  folio_name = "#{jpgn}_#{folio_number}#{side}.jpg"
  puts "Dowloading #{format("%08d",jpgn)}.jpg => #{folio_name}"
  open( "./resources/#{folio_name}", "wb" ) do |file|
     file << open( "#{base_url}#{format("%08d",jpgn)}.jpg" ).read
  end
end


Dowloading 00000388.jpg => 388_192v.jpg
Dowloading 00000389.jpg => 389_193r.jpg
Dowloading 00000390.jpg => 390_193v.jpg
Dowloading 00000391.jpg => 391_194r.jpg
Dowloading 00000392.jpg => 392_194v.jpg
Dowloading 00000393.jpg => 393_195r.jpg
Dowloading 00000394.jpg => 394_195v.jpg
Dowloading 00000395.jpg => 395_196r.jpg
Dowloading 00000396.jpg => 396_196v.jpg
Dowloading 00000397.jpg => 397_197r.jpg
Dowloading 00000398.jpg => 398_197v.jpg
Dowloading 00000399.jpg => 399_198r.jpg
Dowloading 00000400.jpg => 400_198v.jpg
Dowloading 00000401.jpg => 401_199r.jpg
Dowloading 00000402.jpg => 402_199v.jpg
Dowloading 00000403.jpg => 403_200r.jpg
Dowloading 00000404.jpg => 404_200v.jpg
Dowloading 00000405.jpg => 405_201r.jpg
Dowloading 00000406.jpg => 406_201v.jpg
Dowloading 00000407.jpg => 407_202r.jpg
Dowloading 00000408.jpg => 408_202v.jpg
Dowloading 00000409.jpg => 409_203r.jpg
Dowloading 00000410.jpg => 410_203v.jpg
Dowloading 00000411.jpg => 411_204r.jpg
Dowloading 00000412.jpg => 412_204v.jpg
Dowloading 00000413.jpg => 413_205r.jpg
Dowloading 00000414.jpg => 414_205v.jpg
Dowloading 00000415.jpg => 415_206r.jpg
Dowloading 00000416.jpg => 416_206v.jpg
Dowloading 00000417.jpg => 417_207r.jpg
Dowloading 00000418.jpg => 418_207v.jpg
Dowloading 00000419.jpg => 419_208r.jpg
Dowloading 00000420.jpg => 420_208v.jpg
Dowloading 00000421.jpg => 421_209r.jpg
Dowloading 00000422.jpg => 422_209v.jpg
Dowloading 00000423.jpg => 423_210r.jpg
Dowloading 00000424.jpg => 424_210v.jpg
Dowloading 00000425.jpg => 425_211r.jpg
Dowloading 00000426.jpg => 426_211v.jpg
Dowloading 00000427.jpg => 427_212r.jpg
Dowloading 00000428.jpg => 428_212v.jpg
Dowloading 00000429.jpg => 429_213r.jpg
Out[2]:
388..429

Apart from taking the reader to the library yourself, I don't think we can get the reader/user closer to the source we want to clarify for her. Let's reproduce just one of the facsimile folios here for a taste:

There is a second extend manuscript of this text, which can be found in the 'Dykse manuscript' currently held by the Universitäts- und Landesbibliothek Münster. The Münster university library offers exellent facsimiles and accompanying transcriptions. The same script above could be adapted to download facsimiles in high detail too. But as this is out of the scope of this notebook I will leave such as an exercise to the reader. (An exercise which is complicated—but certainly not beyond the impossible—by the fact that the Münster university library serves its high resolution facsimiles as composites of so called image tiles. But some skillful customization of the last parameters of that URL would get you there.)

(Emulating) transcribing the source

Although Transkribus promises a lot, at the moment computationally recognizing Middle Dutch manuscript is still a dream. For this reason I am not going to OCR the manuscript, as it is technological infeasible. But I will also not be transcribing the source manually, because the act of transcribing would not be captured by current computer technology and would not be reproducible.1, and hence I would be violating the fourth self imposed rule.

Luckily there is an existing transcription of this manuscript as part of the open access scholarly edition of the Reynaert that André Bouwman and Bart Besamusca authored (Bouwman & Besamusca 2009). I will use that transcription as the basis of this project. To have a 'simulation' if you will of the process that would be followed if it were possible to OCR the manuscript, I will download the edition, OCR it, and post correct and interpret the results. All this must be done in a meticulous reproducible way.

Let's start by downloading the edition, which is available from http://www.oapen.org/search?identifier=340003:


In [3]:
open( "./resources/340003.pdf", "wb" ) do |file|
   file << open( "http://www.oapen.org/download?type=document&docid=340003" ).read
end


Out[3]:
#<File:./resources/340003.pdf (closed)>

Now we need to OCR the text to get an emulation of the act of transcription.2 We only OCR part of the text (page 42–58) which will be quite sufficient for our purposes here. This is the text up to and including the defense that Reynaert's cousin, the badger Grimbeert, delivers in King Noble's court where Reynaert, absent and quite showing his contempt for the court, is indicted. It is a key moment. When Grimbeert finishes his ardent plea, the beheaded body of a hen named Coppe is carried into the court on a bare.


In [4]:
file_name = "./resources/340003.pdf" # A handle to the PDF document

text = "" # This will contain the OCR'ed text

# Let's notify the user that we're working
print "‘Transcribing’ page "

# We'll do a transcription for each page in the range 42..58
(42..58).each do |page_number|
  
  # Let's notify the user on which page we're working
  print "#{page_number > 42 ? ', ' : ''}#{page_number}"
  
  # We 'lift' each page from the pdf by using the command line tool 'convert' and make sure it has sufficient
  # resolution (300dpi) and color depth for scanning.
  page_image = `convert -depth 8 -density 300 -background white +matte #{file_name}[#{page_number}] tiff:-`
  
  # We use the 'tesseract' command line tool to OCR each page and add the OCR'ed text to the variable 'text'
  IO.popen("tesseract stdin stdout txt", mode='r+') do |io|
    io.write page_image
    io.close_write
    text << io.read
  end

end


‘Transcribing’ page 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58
Out[4]:
42..58

To verify let's check the contents of the first page.


In [5]:
puts text[0..1800]


TEXT, TRANSLATION AND NOTES

 

42

1O

20

25

Willem die Madocke maecte, [192va,22]
daer hi dicken omme waecte,

hem vernoyde so haerde

dat die avonture van Reynaerde

in Dietsche onghemaket bleven

— die Arnout niet hevet vulscreven —
dat hi die vijte dede soucken

ende hise na den Walschen boucken
in Dietsche dus hevet begonnen.
God moete ons ziere hulpen jonnen!
Nu keert hem daertoe mijn zin

dat ic bidde in dit beghin

beede den dorpren enten doren,

ofte si commen daer si horen

dese rijme ende dese woort

(die hem onnutte sijn ghehoort),

dat sise laten onbescaven.

Te vele slachten si den raven,

die emmer es al even malsch.

Si maken sulke rijme valsch,

daer si niet meer of ne weten [192vb]
dan ic doe hoe dat si heeten

die nu in Babilonien leven.

Daden si wel, si soudens begheven.
Dat en segghic niet dor minen wille.
Mijns dichtens ware een ghestille,
ne hads mi eene niet ghebeden

die in groeter hovesscheden

1 AMiddle Dutch story about Madoc has not come down to us, but there are strong indi—
cations that a work with this title did at one time exist. Willem’s earlier tale probably told of a
dream that Madoc had, as seems to be suggested in Maerlant’s Rijmbijbel (cf. p. 16). Madoc is

sometimes considered to have been a story about a seafarer’s adventures.

6 It has been suggested that Van den v05 Reynaerde was written by two poets and that Wil—
lem completed Arnout’s unfinished work. However, serious objections may be raised to this
notion of joint authorship. Assuming that the name was not an invention, it seems probable,
also in view of the emphatic Walsch—Dietsch (French—Dutch) contrast in the lines before and

after the name, that Arnout was a French Renartpoet (cf. p. 15).
13 dorpren (‘peasants’) refers to non—courtly persons.

PROLOGUE

 

1O

20

25

There, we have a transcription. We see a page number, some line numbers, Middle Dutch text, English footnotes, a header and page numbers of the next page. OCR isn't perfect—as is demonstrated by the number 10 being recognized rather as one and small letter O. However, it gives us the glyphs of the edition at least.

From this material we shall build the computational edition of the scholarly edition. But this should not be a digital metaphor of the book. For that the PDF that was downloaded is available, and probably that is good enough for my uses as a reader. This is a point where we could leave off actually. We could say: this is what current computer technology can do, that is: guessing at the glyphs. That's as far as computer science understands text. Or rather we should contend that that is a form and an amount of human knowledge about text that has been computerly expressed. However we can use computer language to upgrade that understanding of the text, by transferring my knowledge about it to computer code. Here we are interested in what it means to transfer the text and knowledge about it to code. What it means to closely read it through code. Maybe even what it means to read text as code?

To explore this we need to take our code reading of the text further. Making sure that we capture each and every step of the process in code. In the next chapter we will 'clean up' our transcription, making sure we have the actual Middle Dutch text. We then proceed to reading the text through code.

Oh.. but let's not forget to write our text to a file, so we can actually reuse it.


In [6]:
File.open( "./resources/Bouwman_ Of Reynaert the Fox.txt", 'w') {|file| file.write( text ) }


Out[6]:
27658

Notes

1) Very stricty speaking this is not true. I could have installed a key logger and have registered every transcription action thus, probably even including look ups and queries I would have initiated from behind my keyboard. For time constraint reasons this is not feasible for this notebook, however. Also the focus of this notebook is the close reading of the text of the source through code, for which the transcription is a preliminary step that can be simulated in its reproducibility as the following section of the notebook shall demonstrate.

2) Digitally informed observers might interrupt here: "Why not just using some PDF to text converter like pdftotext and bypassing the probably less precise and CPU intensive OCR process?" This however would not as faithfully as possible capture and reproduce the act of transcription. Ideally an OCR engine would be able to decently guess at the glyphs of a Medieval manuscript, which is what we mimic or simulate here. However it would never be possible to extract the text from a facsimile image using an algorithm as pdftotext provide. Simply because the image does not contain computer readable text. It only contains pictorial data. </small>

References

Bogost, I., 2009. Wat is Object-Oriented Ontology? A definition for ordinary folk. Ian Bogost. Available at: http://bogost.com/writing/blog/what_is_objectoriented_ontolog/ [Accessed May 18, 2016].

Bouwman, A. & Besamusca, B., 2009. Of Reynaert the Fox: Text and Facing Translation of the Middle Dutch Beast Epic Van den vos Reynaerde, Amsterdam: Amsterdam University Press. Available at: http://www.oapen.org/search?identifier=340003 [Accessed November 20, 2015].

Dalen-Oskam, K. van & Zundert, J.J. van, 2007. Delta for Middle Dutch: Author and copyist distinction in “Walewein.” Literary and Linguistic Computing, 22(3), pp.345–362.

Hayles, K.N., 2012. How We Think: Digital Media and Contemporary Technogenesis, Chicago (US): University of Chicago Press.

Kestemont, M., 2012. Het gewicht van de auteur. Een onderzoek naar stylometrische auteursherkenning in de Middelnederlandse epiek. Universiteit Antwerpen, Faculteit Letteren en Wijsbegeerte, Departementen Taal- en Letterkunde.

Knuth, D.E., 1984. Literate Programming. The Computer Journal, 27(1), pp.97–111.

Marino, M.C., 2006. Critical Code Studies. Electronic Book Review. Available at: http://www.electronicbookreview.com/thread/electropoetics/codology [Accessed January 16, 2015].

Meister, J.C., 1995. Consensus ex Machina? Consensus qua Machina! Literary and Linguistic Computing, 10(4), pp.263–270.

Object Oriented Programming. Wikipedia, the free encyclopedia. Available at: https://en.wikipedia.org/wiki/Object-oriented_programming [Accessed May 18, 2016].

Ramsay, S., 2011. Reading Machines: Toward an Algorithmic Criticism (Topics in the Digital Humanities), Chicago (US): University of Illinois Press.

Underwood, T. & Sellers, J., 2015. How Quickly Do Literary Standards Change? Available at: http://figshare.com/articles/How_Quickly_Do_Literary_Standards_Change_/1418394 [Accessed December 16, 2015].

Zundert, J.J. van, 2016. Editor, Author, Engineer: Transformation of Authorship in Scholarly Editing? Interdisciplinary Science Reviews, 41(1), [forthcoming].

</small>


In [ ]: